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Abstract 

The robust estimation of dynamically changing features, such as the position of prey, is 
one of the hallmarks of perception. On an abstract, algorithmic level, nonlinear 
Bayesian filtering, i.e. the estimation of temporally changing signals based on the 
history of observations, provides a mathematical framework for dynamic perception in 
real time. Since the general, nonlinear filtering problem is analytically intractable, 
particle filters are considered among the most powerful approaches to approximating 
the solution numerically. Yet, these algorithms prevalently rely on importance weights, 
and thus it remains an unresolved question how the brain could implement such an 
inference strategy with a neuronal population. Here, we propose the Neural Particle 
Filter (NPF), a weight-less particle filter that can be interpreted as the neuronal 
dynamics of a recurrently connected neural network that receives feed-forward input 
from sensory neurons and represents the posterior probability distribution in terms of 
samples. Specifically, this algorithm bridges the gap between the computational task of 
online state estimation and an implementation that allows networks of neurons in the 
brain to perform nonlinear Bayesian filtering. The model captures not only the 
properties of temporal and multisensory integration according to Bayesian statistics, but 
also allows online learning with a maximum likelihood approach. With an example from 
multisensory integration, we demonstrate that the numerical performance of the model 
is adequate to account for both filtering and identification problems. Due to the 
weightless approach, our algorithm alleviates the ’curse of dimensionality’ and thus 
outperforms conventional, weighted particle filters in higher dimensions for a limited 
number of particles. 


Author Summary 

Every day, our brain is facing the challenge of making sense of the rich and dynamical 
stream of sensory inputs. Those inputs are often ambiguous, noisy and sometimes even 
conflicting. That we are nevertheless able to make sense of our surrounding naturally 
points to the important question how estimates of real-world variables that led to 
perceptive input, e.g. the position or the velocity of an object, are formed. Further, it is 
unknown how the computational task of real-time state estimation can be implemented 
in a realistic neuronal architecture. Here, we propose an algorithm, the Neural Particle 
Filter, that performs state estimation, in a way that captures essential properties of 
perception: it takes into account prior knowledge of the environment, weights different 
sensory modalities according to their reliability and is able to dynamically adapt to 


1/23 





changes. Implemented as a neuronal dynamics, the Neural Particle Filter predicts 
activation properties of the neurons involved in perception. 


Introduction 


During the last decade, an increasing number of studies have stated that the brain 
performs probabilistic inference during perceptual tasks [^[^. As an act of 
(approximate) Bayesian inference, perception relies on noisy and incomplete data that 
needs to be integrated across multiple sensory modalities and weighted according to 
sensory reliability. In addition, perception makes use of the strong statistical regularities 
of objects in our environment by forming prior beliefs about the world. Since our 
environment is fundamentally dynamic, the ability to adapt to changes in real time is 
essential for perception. The Bayesian brain hypothesis is supported by ample 
experimental evidence, ranging from psychophysical findings [3j|^ to neuronal 
recordings |6]-[^ that are in line with Bayesian computation. However, most of the 
studies concerned with the theory of perception consider fairly simple tasks, where the 
observations are created either from static hidden variables or from hidden variables 


with a discrete state-space 10 , or the underlying dynamics are considered linear 11 12 


In a dynamical setting, where temporally changing signals have to be estimated 
online from the history of observations, Bayesian inference is commonly referred to as 
‘filtering’. In general, nonlinear Bayesian filtering is a challenging task even without the 
imperative of a plausible implementation on a neuronal architecture. If the prior 
distribution is a Gaussian and the noisy observations depend linearly on the hidden 
states, the inference problem is solved by the Kalman filter 13 14 , which has received 


substantial attention in the signal processing community and turns out to be of 
increasing importance in neuroscientific phenomenological modeling, e.g. in a 
sensorimotor integration task or in estimating motor disturbances from an adaptive 


gain 15 . Solutions for the more general nonlinear, i.e. non-Gaussian, filtering 


problem 16 17 are analytically intractable and thus have to be approximated. 


Sampling-based approaches have proven to be a powerful tool to solve the nonlinear 
filtering problem numerically. In principle, they enable any posterior distribution to be 
represented with an accuracy that depends on the number of samples. On the one hand, 


so called particle methods (see for instance 18 19 ) are well suited for dynamical priors, 
but suffer in high dimensions due to the degeneracy of the importance weights and it is 
still unclear how to implement such an inference scheme in a neuronal network. On the 
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other hand, Langevin sampling 20 21 and related techniques, such as the ‘fast sampler’ 


provide a promising ground for a biologically plausible implementation of neural 
but are restricted to static generative models. 


or synaptic sampling 23 24 


Following a sampling-based approach, we propose a framework for how the brain 


could perform filtering from noisy sensory stimuli, considering Marr’s three levels 25 
the computational level, the algorithmic and representational level and the 
implementation level. On the first level, the computational task of dynamical state 
estimation is set in the context of continuous-time continuous-state nonlinear filtering 
theory. Motivated by this rigorous mathematical theory, we propose a weight-less 
particle filter, the Neural Particle Filter (NPF), that approximates the posterior at each 
time step by sampling from it. This algorithm can further be tuned by maximum 
likelihood learning and thus allows for rigorous corrections in the algorithmic ansatz, as 
well as learning the model parameters. The NPF exhibits properties that are considered 
crucial for perception. On the implementation level, we interpret the NPF as a 
biologically plausible neuronal dynamics and identify the particle states with activities 
of task-specific neurons. 


23 



































Results 


The results we are presenting are subdivided in two parts: first, we will introduce the 
Neural Particle Filter as a conceptual result. This first part will cover the first two of 
Marr’s three levels, namely i.) the computational level with a generative model layout 
and a task description, and ii.) the algorithmic level, which outlines our choice of 
representation and the approximate solution to the nonlinear filtering problem that is 
based on this representation. In the second part, we demonstrate key properties of the 
NPF, and we we illustrate how they might serve as a model for a neuronal dynamics 
involved in perception. 


Model 


Nonlinear filtering as a generic computational task We formulate the 
computational task in terms of the classical filtering problem (according to standard 
literature on nonlinear filtering, e.g. 26 27 ). The hidden stat^Xj G K", i.e. the 


real-world variable that the brain cannot access directly, follows the Ito stochastic 
differential equation (SDE): 

dXf = f(xt)dt + S^^^dLi!t, 


( 1 ) 


with a nonlinear, deterministic drift function f(x) : M” —)■ K". Stochastic diffusion is 
governed by the uncorrelated Brownian motion procestj^u;* S K" with noise covariance 

E,, e K” X M”. 

At each moment in time, the hidden state Xj gives rise to noisy observations 
ys G that represent sensory input. The observation dynamics is again modeled in 
terms of an ltd diffusion, with a drift term following the hidden states via a generative 
function g(x) : K” -G- and a Brownian motion diffusion, modulated by the sensory 
noise covariance Ey G K™ x K"*: 


dyt = g{^t) dt + di^t 


( 2 ) 


Together, Eqs. Q and ([^ define a generative model. 

Solving the filtering problem is the task of finding the posterior probability p(xt\yt) 
of the hidden state, conditioned on the whole sequence of observations 
yt = {ys, s G [0, t]} up to time t. For a linear hidden dynamics f(x) and a linear 
observation dynamics g(x), this task is solved by the Kalman-Bucy filter 14 , which is a 


continuous-time version of the well-known Kalman filter. However, the solution to the 
nonlinear filtering problem is in general analytically intractable, because it suffers from 
the so-called closure problem (see |Sl Appendix ). Therefore, introducing a suitable 
approximation is an inevitable step when approaching the nonlinear filtering problem. 


Sampling-based representation We approximate probability distributions in 
terms of a finite number of variables. For example, this can be achieved by taking N 
weighted samples: 

N N 

~ ^w;fc(5(x-xf^), with ^ Wife = 1. (3) 

k—1 k—1 

Thus, the probability of the random variable to have a certain value range is 
proportional to the relative number of samples within this range, weighted by their 
respective weight Wi. 

^For consistency, vectors will be printed in bold face, i.e. v = {vi,V2, ■ ■ ■ )^. 

^For Brownian motion processes: (doitdaij) = dt if t = s, otherwise {dLJtdtjJ) = 0. 
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Filtering algorithms representing the posterior in this sampling-based manner are 
commonly referred to as particle filters. In standard particle filters (such as outlined 

(k) 

), update rules for the trajectories xj , as well as the weights Wk are given. 
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Despite asymptotic convergence to the true posterior for an infinite number of particles, 
this approach has two disadvantages: First, one finds numerically that after a finite 
number of time-steps most particle weights decay to zero, which depletes the number of 
effective samples. Weight decay, or degeneration, is an undesirable trait of weighted 
particle methods in general. As stated in the convergence theorem 29 Theorem 23.5], 
the upper bound of the divergence between true posterior and the posterior estimated 
by the weighted particle system is a function of time, and hence might be growing due 
to the weight decay. Second, the problem is exacerbated if the number of dimensions of 
the hidden state Xj is large. In this case, the number of particles needed for good 
numerical performance grows exponentially with the number of dimension, a variant of 
the ‘curse of dimensionality’ 


30 


In the theoretical neuroscience literature, sampling-based approaches for filtering 
with a representation of the posterior as in E g. ([^ have not received much attention so 
far (one of the few examples can be found in [31| ), although they have some 
experimental support [7, 32 and are considered relevant according to the neural 


sampling hypothesis 33 . Therefore, we would like to explore this approach further. To 


overcome the difficulties encountered with weighted approaches, we consider a particle 
filter with equally weighted samples, i.e. Wk = 1/iV Vfc. 


Filtering with the Neural Particle Filter As an inference algorithm, we propose 
an SDE that governs the dynamics of particles z^. Let us consider N i.i.d. stochastic 

(k) 

processes z] , fc = 1,..., N, conditioned on the observations following the Ito 
diffusion 

dzf^ = f (zf ^) dt-k Wt (dyt - g(zf ^) dt)-tsy^dwt, (4) 


where Wj S K” is an uncorrelated vector Brownian motion process and Wt is a 
time-dependent gain matrix or decoding weight matrix. 

Equation Q, which we will further refer to as the Neural Particle Filter (NPF]|^ is 
an ansatz that serves as a sampling-based approximation to the nonlinear filtering 

(k) 

problem: we consider each of the N stochastic processes z] ’ as an independent sample, 
or particle, of the true posterior p(xt\yt) at every time t. Thus, expectations from the 
posterior are computed according to E[0(xt)|3^t] « ((/)(xt)) = l/NJ2k 

This ansatz is motivated by the formal solution to the filtering problem, more 
precisely by the dynamics of the first posterior momenlj^ and shares some important 
properties with classical filtering methods: First, it is governed by both the dynamics of 
the hidden process x^ and by a correction proportional to the so-called innovation term 
dnt = dyt — g(zt) dt. The innovation term compares the sensory input dyt with the 
current prediction g(zt) dt according to the single particle position, and thus can be 
seen as a predictive error signal |^. Second, the gain matrix Wt determines the 
emphasis that is laid on new information via observations dyt- This is conceptually 
similar to a Kalman gain 13 m for a linear model, and adjusts according to the 
reliability of a single or multiple observations. 


^In the consecutive section, we will identify the particles with neuronal activities, which is why we 
call it Neural Particle Filter. Though the name is similar, our NPF is not to be confused with the 
‘neural filtering’ approach in [34] , which is an unsupervised learning algorithm in an artificial neural 
network; _ 

^See [^ Appendix] for an outline of the formal solution and the dynamics of the first posterior 
moment. 
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The gain introduces a weighting between the prior probability distribution p{:x.t) 
induced by Eq. Q, and the likelihood function p{yt\xt) induced by Eq. ([^ and thus 
serves as a measure for the peakedness of the likelihood. If the decoding weight is large, 
the dynamics in Eq. Q will entirely be determined by the innovation term, and the 
inter-particle variability governed by the diffusion term will be negligible. Therefore, the 
resulting probability distribution is given by p{xt\yt) « p{zt\yt) S(zt — g~^(^)). In 
this limit, the deterministic observation limit, a single sample from Eq. @ suffices to 
represent the posterior. On the other hand, if the decoding weight is zero, new 
information is disregarded, and each sample evolves just like an i.i.d. process from 
Eq. Qj^In this case, the resulting probability distribution simply equals the stationary 
prior distribution p(xt). 

For the gain Wt, we use the ansatz Wt = cov(xt, g(xt)’^)E“^, an empirical choice 
motivated by the mean dynamics of the formal solutior0 This gain adjusts according to 
the observation noise Sy as well as to the spatial ambiguity as measured by the 
empirical, i.e. instantaneously estimated from the particle positions, covariance between 


the state Xj and the observation function g(xt) (Eq. 13 in Methods). Although this 


choice is rather heuristic, it achieves a numerical performance comparable to that of a 
standard particle filter (PF), as demonstrated below, and is moreover straightforward to 
implement by empirically estimate the covariance from the particle positions. 


Parameter learning In a more general setting, model parameters of Eqs. Q and ([^ 
may not or only partially be known, and thus need to be learned online from the stream 
of observations yt- In this case, the NPF algorithm can be extended to include a 
parameter update that performs an online gradient ascent on the log likelihood 




dyt - ^(g(xt)) 


^(g(xt)) dt, 


( 5 ) 


which in turn is computed directly from the approximated filtering distribution itsel:Q 
It can be shown that maximizing this log likelihood is equivalent to minimizing the 


prediction error in continuous time (see SI Appendix). 

Further, not only the model parameters in Eqs. (|^ and (i). but also the decoding 
parameters, i.e. components of the decoding weight or gain matrix Wt-, can be learned 
with a maximum likelihood approach, instead of setting it according to the empirical 
estimate from the particle positions. This alternative corrects for the heuristic ansatz of 
the NPF equation by determining the decoding weights rigorously. In fact, it can be 
shown that parameter learning with a maximum likelihood approach is able to make up 


even for a very poor filtering ansatz by setting parameters accordingly 35 


The Neural Particle Filter as a neuronal dynamics for 
perception 


In this section, we set the computational task in the context of perception and base the 
implementation of the algorithm on a neuronal architecture. With a simple example, we 
now illustrate that our algorithm captures the following key properties of perception [^: 
1) it relies on noisy and incomplete sensory data, 2) it uses prior knowledge of the 
dynamic structure of the environment 3) it efficiently combines information from several 
sensory modalities, and 4) it can dynamically adapt to changes in the environment. 


®See footnote 1^ 
®via (g{xt)} AT-l 


Eq. @ in 


Methods 
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Multisensory perception as filtering 

Consider a frog who sits below two branches and observes an insect flying between the 
two branches (Fig. [^). The frog wants to track the position of the insect Xt, which is 
governed by 


dxt = 'ixt{l — xf) dt-\-dujt, (6) 

where the Brownian motion process LOt accounts for noise due to the erratic behavior of 
the insect. This dynamics gives rise to a bimodal stationary distribution for the position 
of the insect (cf. Fig. &)■ 

The frog cannot directly observe the state Xt of the insect, but instead has to rely on 
two sensory channels, a visual {vt) and an auditory (a^) channel. Observation dynamics 
in these channels are given by 

dvt = Xt dt + cTydldt, (7) 

dat = tanh{2xt) dt + aadjt, (8) 

where are independent Brownian motions that model noise in the sensory 

channels, making vt and at conditionally independent. The nonlinearity in the auditory 
channel (Eq. is motivated by the fact that sound localization depends on interaural 
difference, which resembles a sigmoid in this example. In order to localize the fly, the 
frog has to perform the task of nonlinear filtering and to compute 
Xt = ]E[a;t|r!s, Os, 0 < s < t], i.e. the position of the insect, from the visual and auditory 
sensory streams. Note that due to the nonlinear dynamics of the hidden and observation 
processes, this example is analytically intractable and thus requires an approximation. 

We propose that this task is solved by a set of N filtering neurons zj, i = 1,..., N. 
Their neuronal dynamics are given by the NPF 0 and for this particular example read: 

dz^"^ = (l - (zf V) dt + di^t'^ 

+Wt'"'^ {dvt — Zj*^ dt) + {dat — tanh(2z|*^) dt), (9) 

which is governed by the dynamics of the prior as well as corrections evoked by novelty 

(v) 

of the observations in the sensory channels, that are modulated by gains Wt and 
Wt°‘\ Thus, our model readily captures the first two key properties of perception. 

The empirical distribution of neuronal activities Zj*^ approximately samples the 
posterior distribution, thereby acting as a weight-less particle filter that successfully 
tracks the position of the insect (Fig. 0 )- The state estimate Xt (posterior mean) can 
be read out from this population by averaging the activities of the filtering neurons, 
i.e. Xt « (zt) = 

The potential of having a description of the full posterior stretches far beyond 
simple state estimation, where one is only interested in the first moment. Particularly 
the sampling-based approximation of this posterior allows a convenient estimation of 
other relevant quantities. For example, the frog might want to know on which branch 
the insect is sitting in order to catch it more easily. The frog could directly deduce a 
certainty level for the left and right branch, respectively (Fig. by counting the 
number of neurons within a certain activity range. 

In a similar manner, higher-order moments of the posterior distribution can be 
approximated with the samples that correspond to neuronal activities. Even though 
these approximated moments are not exact (Fig.[^), the overall posterior shape is 
captured to a considerable extent. For some nonlinearities, our proposed model is 
therefore superior to models relying on an approximation of just the first two moments 
of the distribution. For instance, the Extended Kalman Filter (EKF) does by definition 
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Figure 1; The NPF as a model for perception for multisensory perception. 

(a) The frog has to do filtering in order to dynamically track the position xt of the insect, 
which is switching between two branches. We model this behavior by the stochastic 
dynamics in Eq. §, which gives rise to a bimodal prior distribution p{xt = x). In order 
to track the insect, the frog has to make use of this prior, and optimally combine it with 
the noisy input from the sensory modalities Vt and a*, (b) Tracking simulation with 
N = 1000 filtering neurons and sensory noise cr^ = = 0.1. The upper panel shows 

the true trajectory of the insect (solid black line) and particle densities. The regions 
between the dotted black lines denote the two branches, and certainty levels certainty 
levels in the middle panel correspond to the relative number of particles whose states are 
within one of the two branches. At each time, the sensory gains in the lower panel are 


computed according to Eqs. (10) and (11). (c) Performance in terms of time-averaged 


mean-squared error MSE = {{xt — XtY)t (dotted lines: standard particle filter) and (d) 
time-averaged gains (solid line) and (lF/“^)t (dashed line) as function of sensory 

noise. 


not capture higher order moments of the posterior and thus fails to reproduce its 
features}^ This explains why the predictive performance of the EKE in terms of MSE is 

^For larger observation noise, matters become even worse because the state estimate of the EKF 
evolves to one of the fixed points of f(x) and remains there, irrespective of the real hidden state. 
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(a) (b) 


Bias of estimated variance S 



(c) 


Figure 2: MSE and estimation of posterior variance from single-cue observa¬ 
tion. Simulations correspond to the example model with only (a) visual or (b) auditory 
cue, i.e. a generative model with Eqs. (§ and Q or Q, respectively. The gain in the 
NPF is tuned according to Wt = cov(xt, g(xt)^)E“^ and according to a gradient ascent 
on the log likelihood (‘ML’), respectively For benchmarking, we use a standard PF. The 
performance of the NPF is nearly indistinguishable to that of the PF. In addition, we 
compare the performance of an EKF. (c) The posterior variance E on average exhibits 
a bias Bias(S) = S — Tspp when compared to that of the PF. Here, solid (dashed) lines 
refer to the visual (auditory) channel. 


fairly poor compared to that of the NPF (Fig. [2}i,b). 

Cue integration The decoding weights, or gain factors, and wj°‘\ are 

exemplary for essential multisensory integration. They balance the relative effects of the 
two sensory modalities and the prior on the dynamics of the filtering neurons and thus 
quantify the reliability of the sensory channels. Here, we consider 
Wt = cov(xt, g(xt))E“^ in the NPF. Thus, in this example, the weights evaluate to 

H^/“^ = YaT{xt)a-^ (10) 

= Cov(a;t,tanh(2a;t))cr“^, (11) 

where posterior variances and covariances are estimated empirically from the neuronal 
activity distribution. The gains adjust according to both the sensory noise levels 
(Fig#) and the spatial ambiguity evoked by the sigmoid observation function for the 
auditory channel at (Fig. [^). In particular, the gains become large if a channel is 
particularly reliable, and in extreme cases dominate the dynamics of the filtering 
neurons, corresponding to the deterministic observation limit discussed earlier. The 
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appropriate weighting of sensory information allows the neurons to solve the filtering 
task near-optimally and comparable to a standard PF, which is reflected by our 
simulation results in Fig. and 


Adapting to changes In our example, the frog could successfully track the position 
of the insect, but it could only do so because it had access to the generative model 
parameters, i.e. it knew the prior dynamics of the insect and it it was aware of how the 
sensory percepts were generated from the true state of the insect. Also, knowledge of 

(-u) 

these model parameters were crucial for determining the sensory weights TFy ^ and 

and thus significantly influenced the dynamics of the filtering neurons. However, 
the external world, i.e. the model parameters, does change over time, and successful 
perception should adapt accordingly, i.e. the model parameters should be adjusted by 
online learning from the stream of observations 

We illustrate the learning of generative model parameters using our example (see 


Methods and SI Appendix for details). This time, the frog only relies on his visual 


channel Vt, but in addition to tracking the insect, he also has to learn the generative 
factor J in the function g{x) = Jx, which relates the position of the insect to the visual 
input. Simultaneously, it also learns the gain and with that implicitly an estimate 
of the reliability of its visual input. Figure]^ shows that this identification problem can 
be solved efficiently by the NPF, with an MSE that gradually approaches that of the 
benchmark (a standard PF with the ground-truth parameters) as the estimate of the 
parameters get more accurate (Fig. i^). We find this to be true over a wide range of 
observation noise Values of the estimator J, i.e. the learned value of the generative 
factor, tend to exhibit a slight negative bias, but for an observation noise of up to 
T,y = 0.1 still stay in a 2%-region below the true generative weight. 

It is noteworthy that the learning rule for the generative factor J can be simplified 
substantially for small observation noise: 




{{dvt — dt) ■ 


( 12 ) 


Our findings suggest that this learning rule, which we will refer to as ‘Hebbian’ for 
reasons we illustrate below, leads to an estimator J for the generative weight J which is 
slightly negatively biased across initial conditions (Fig. |§). The absolute value of this 


bias decreases for smaller observation noise Ey and the learning rule in Eq. (121 
becomes exact for 'Ey 0. Moreover, this bias does not seem to affect the filtering 
performance as measured by the MSE (Fig. 


Neuronal Implementation 

Recurrent neuronal dynamics In our example, we have interpreted the NPF as 
the neuronal dynamics of a population of A x n filter neurons , whose neuronal 
activities represent samples of the posterior, which is in line with the neural sampling 
hypothesis 33 . Thus, analog neuronal activities are identified with the continuous 


particle state, for instance in terms of their instantaneous firing rate. The internal 
dynamics, or self interaction, of the filtering neurons is governed by the nonlinear 
function f(zt), which incorporates prior knowledge as a state-dependent leak. In 
addition, they are recurrently connected to populations of novelty neurons Uj via the 
feedforward synaptic weights Wt and feedback connections whose strength is governed 
by the nonlinearity g(zt)- The dynamics of the novelty neurons are governed by the 
innovation term dut = dyt — g(zt) dt. As input to this network we consider a neuronal 
population yt whose rates are evoked from the underlying hidden stimulus X( via the 
generative dynamics in Eq. (© (Fig.i. 
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Figure 3: Model parameters are learned by a stochastic gradient ascent on 
the log likelihood. Simulations shown here correspond to the example model with 
only a visual cue, i.e. a generative model with Eqs. and 0. The sensory gain in 
the equation for the filtering neurons is tuned according to maximum likelihood, the 
decoding weight is learned either by maximum likelihood (ML) or the Hebbian learning 
rule (Hebb) in Eq. (12), which is a valid approximation for small sensory noise. As 
benchmark, we use a standard particle filter with the true model parameters. For all 
parameters, learning starts at t = 0. (a) Filtering and online learning of the generative 
weight J = 1 and sensory gain W for sensory noise = 10“^. As J approaches the true 
value, the MSE of the NPF resembles that of the standard PF. (b) Bias of estimated 
generative weight J. (c) The filtering performance is not affected by the bias of J. 
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novelty neurons 
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Figure 4: Neural Network Implementation. Implementation of Eq. Q as a recurrent 
neuronal network. Here, yj, and z* denote neuronal populations. 


Hebbian learning In this interpretation, W corresponds to the matrix of synaptic 
weights that connects novelty neurons n to filtering neurons z). . If the generative 
function g(xt) = Jx* is linear, then J denotes the matrix of feedback weights which 
connects filtering neurons to novelty neurons. In general, the learning rules for these 
weights are not local, i.e. they rely on the state of the whole network (cf. Eqs. 22 and 


241. However, in the deterministic limit the learning rule for the generative weight 


matrix can be replaced by a learning rule that is both Hebbian and local and relies on a 
multiplication between pre- and postsynaptic activity (Eq. [^and, more generally, 

Eq. 26 see Methods). Eurther, for small observation noise, Wt can be replaced by a 


constant matrix without affecting the filtering performanc^ Therefore, at least in this 
limit, the network presented in Eig.j^is implementable as a neuronal dynamics of a 
recurrent network with local Hebbian synaptic plasticity. 


The NPF alleviates the ‘curse of dimensionality’. In our example the frog had 
to estimate only a single hidden state, namely the one-dimensional position of the insect. 
In a more realistic setting, there is a large number of hidden states, ranging from the 
position of an object in three-dimensional space to the relative presence of a features 
making up a visual scene. Therefore, any filtering algorithm employed by a neuronal 
population in the brain should be economical in its resources, i.e. the number of 
neurons needed to solve the filtering task to a certain performance level should scale 
well with the number of hidden variables. The NPF, in particular when the decoding 
weight Wt is determined with maximum likelihood, is able to solve the filtering task in 
higher dimensions with just a limited number of filtering neurons. It thus alleviates the 
curse of dimensionality, which would be devastating for a realistic implementation. 

We want to illustrate this point numerically with a toy exampl^ comparing the 
filtering performance of the NPF in terms of its MSE for very small number of particles 
to that of a standard PE (Fig. [^. In general, we find that the NPF performs well even 
for a limited number of samples. If we further allow the NPF to learn its decoding 
weight matrix Wt by maximum likelihood, the filter is able to solve the filtering task 
with a single particle almost as good as with more particles. On the other hand, in a 
single-particle scenario the NPF using Wt = cov(x(,g(xt))E“^ as well as the standard 
particle filter unsurprisingly exhibit an MSE, which corresponds to an independent 
trajectory of the prior. Upon increasing the number of samples, the PF outperforms the 
NPF again, owing to the approximations employed by it. However, this crossover-point 
tends to move towards larger number of particles for higher dimensions (e.g. Fig.j^, 
which is also not surprising: The PF trajectories evolve according to the prior, and in 
higher dimensions only a few particles will be in the correct spatial domain fitting to 


8 

9 


as long as the weights are large compared to the prior dynamics 


Methods 


for numerical details 
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Figure 5: The NPF alleviates the ‘curse of dimensionality’. Toy example with 
nonlinear hidden dynamics and linear observation dynamics. Solid lines: D = 5 hidden 
dimensions, dashed lines D = 1 hidden dimensions for comparison. The error indicates 
the mean Euclidean distance of the estimated state (xt) from the real hidden state X(, 
normalized by the trace of the prior variance to make the error dimension-free. The 
black marker indicates the approximate cross-over point between the Neural Filter with 
gain matrix Wt obtained by maximum likelihood and particle filter in D = 5 (solid) and 
D = 1 (dotted) hidden dimensions, which tends to move towards larger particle numbers 
as the dimensionality grows. Note that the number of filtering neurons is not equal to 
the number of particles N, but rather to D ■ N. 


the observation and will thus have a non-vanishing weight 

The robustness of performance for smaller number of particles, which we observe in 
the NPF, is mainly due to the direct influence of the observations dyt on the 
trajectories of the samples that correspond to the neuronal activities. In the NPF, each 
neuronal activity itself can be seen as a ’’mini approximation” of the true posterior in 
terms of a 5-function, an approximation which becomes exact for very small observation 
noise Of course, the larger Ey becomes, the less the true posterior resembles a 
5-function and the more particles are needed to account for its shape in general. 


Discussion 


In this paper, we formulated the computational task of nonlinear Bayesian filtering. 
Based on the theory of nonlinear filtering, we proposed an Ito SDE for the posterior 
process and derived a learning rule for the parameters of this filter itself as well as for 
the generative weights of the underlying generative model. We have thus put forward an 
algorithm that allows for approximate filtering in continuous time for continuous-valued 
hidden processes. This algorithm allows the hidden dynamics as well as the observation 
dynamics to be nonlinear, and thus our model is flexible in representing a large class of 
general signal and emission statistics. The sampling-based framework is a central aspect 
of our NPF and as such is well in line with the ‘Neural sampling hypothesis’ 


33 


Besides being an ansatz that is broadly consistent with neuronal dynamics, the 
neural filter equation we propose in Eq. Q is particularly suited to model perception 
phenomenologically, because it shares some important properties with perception. First, 
perception relies on noisy and incomplete sensory data, and uses these to make sense of 
the world, which in our model is reflected by inferring the hidden state variable. Second, 
because prior dynamics directly enter the neuronal dynamics, prior knowledge about the 


'^®Note that this is not the weight decay due to iterative re-weighting, but rather a simple consequence 
of importance sampling in high dimensions, illustrating the ‘curse of dimensionality’ for weighted particle 
methods. 
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environment is automatically incorporated and can in principle be learned. Third, 
information from different sensory modalities is efficiently combined as a weighted input 
to the population of filtering neurons. Lastly, perception can adapt to changes in the 
environment, which is taken into account by a dynamical gain and online parameter 
updates. 

The implementation on a biologically realistic architecture imposes constraints on 
the algorithm itself as well as on how we can interpret its elements and structure. We 
should always be aware that these constraints describe a highly simplified version of the 
real biological underpinning, but have successfully been applied in network models to 
qualitatively understand core computations in the brain 36 p. 229ff]. First, neurons 


communicate among each other via discrete spikes, i.e. a digital signal. In contrast, we 
use the term ‘neuronal activity’ to denote an analog quantity. In some cases, for instance 
for a large number of neurons 36, p. 231], this ‘analog quantity’ may for instance 


correspond to an instantaneous firing rate. To take into account ‘negative’ neuronal 
activities, we could also consider deviations from a baseline firing rate or the membrane 
potential of the neuron, or the logarithm of the firing rate. Secondly, computations in 
and between neurons are performed through a weighted sum of inputs from cells they 
are connected to, and these inputs may be a (nonlinear) transformation of the 
presynaptic activity. Third, a hallmark of neural circuits is the synaptic connectivity 
between the neurons, the connection strength of which quantified by connective weights. 
These synaptic weights are modified by learning rules, which in the most simple case are 
local, i.e. they depend on the pre- and the postsynaptic neuronal activity. The learning 
rules in our model are in general not local, and in fact, each filter neuron has to 
know about the state of every other filter neuron and/or novelty neuron. Apart from 
that, when parameters are learned online, it is not clear how the filter derivative should 
be implemented in the network. However, we have shown that the learning rules of our 
model become both Hebbian and local for small observation noise, making these 
learning rules biologically plausible in this limit. Lastly, because the number of neurons 
in the brain is finite, computations clearly have to rely on a finite number of neurons, a 
fact we are taking into account by representing probability distributions with samples. 
Because these requirements are met by our proposed network structure, we consider our 
neuronal dynamics for filtering to be in line with standard network models. 


Comparison to related work 

The NPF is a filtering algorithm for a continuous-time continuous-state generative 
model with nonlinear hidden and observation dynamics. Filtering algorithms based on 
linear generative models have been subject to extensive research, and mainly study how 
the analytical solution to this problem, the Kalman filter, can be implemented with 
neurons (e.g. 11,12 31,37 ). However, the posterior resulting from a Kalman filter is 


always Gaussian, which is highly restrictive and does not properly reflect activity 
distributions observed in neurons (compare for instance the observations related to 
sparse coding as in 38 ). Unlike the various extensions of the Kalman filter 13 - such 


as the EKF or the unscented Kalman filter - which are applied to nonlinear 
systems, the NPF is not restricted to approximate the posterior by a Gaussian 
parametrized by its first and second moment. Rather, due to the nonlinearity in the 
network dynamics it may represent any probability distribution at any given time step. 

An important aspect of our work is the sampling-based representation of probability 
distributions, whereby the activity of each neuron is considered a single sample. There 
are two main competing proposals about how the probability distributions underlying 
Bayesian computations might be represented in the brain. Firstly, it has been suggested 
that probability distributions are expressed as probabilistic population codes 
(PPG 1^), in which each neuron represents a state of the encoded random variable and 
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their activities are proportional to the probability of the corresponding state. Filtering 
approaches based on population codes, in which the neuronal activity directly relates to 
the posterior or the log posterior, have been explored in the literature for a large set of 
models, e.g. 11 12 41 42 . In this representation, neurons directly correspond to the 


parameters of the distribution, and thus the critical factor for accuracy is the number of 
neurons. Further, they all suffer from the ‘curse of dimensionality’ for multimodal 
distribution. The second proposal, called neural sampling hypothesis 


33 


uses an 


inference scheme where the activity of each neuron represents a sample from the 
underlying probability distribution. Since our filtering algorithm is based on unweighted 
samples, our findings are in line with the advantages of the sampling-based 
representation outlined in 33 : it can represent any distribution without the need for a 


parametric form, it mitigates the ‘curse of dimensionality’ and it is well-suited for 
learning. Filtering approaches implementing Markov-chain Monte Carlo (MCMC) 

but since they rely on a discrete 


algorithms have received some attention lately 43 44 


the 


state space and assume a different coding scheme than the one suggested in [33| 
advantages listed there do not necessarily emerge from these models. 

As a filtering algorithm, the NPF is well in accordance with existing sample-based 
filtering approaches. Our ansatz may be seen as a particle filter where all particles carry 
the same weight and which, therefore, avoids numerical pitfalls such as weight 
degeneracy. This problem is notorious in standard MCMC particle filters 


18 and 


becomes even more severe as the number of hidden dimensions grows. The curse of 
dimensionality, i.e. the exponential growth of approximation error with the dimension of 
the underlying model, is an inevitable nuisance in standard MCMC approaches. There 
are some tricks to deal with these problems, for instance by particle resampling or using 
a more refined propagator for the particles (like the optimal importance function 


18 ), but neither solution is able to properly circumvent weight decay in general. 


Moreover, there is currently no proposed implementation of weighted particle methods 
in a neural architecture. For instance, the need to renormalize the weights at each time 
step introduces a coupling between the particles: While their trajectories are 
independent, their weights are certainly not. On the other hand, the neural filter, not 
relying on importance weights in the first place, does not suffer from these numerical 
pitfalls and their related implementational issues. The curse of dimensionality seems to 
be avoided, or at least mitigated, by the fact that the observations directly enter the 
particle trajectories. However, the particles following the NPF dynamics are not 
completely independent either. Coupling between the particles is mediated by the 
decoding weight matrix Wt, whose learning rule is influenced by all the particles. This 
could be avoided by fixing Wt to a constant consistent with observational noise Ey, 
e.g. after learning. Even if it is numerically a bit off what we think the ‘real’ decoding 
weight should be, the filtering performance is not seriously affected and particle 
trajectories are effectively decoupled. 

In the literature, there have been other approaches for particle filtering without 
importance weights, derived rigorously from mathematical filtering theory 45 46 . One 


of these approaches, the so-called feedback particle filter 45 , is based on a similar SDE 
for the particle trajectories as the one we propose in Eq.^). It can be shown that the 
underlying distribution of the feedback particle filter evolves exactly according to the 
Kushner equation, whereas our approach merely approximates it. However, the 
computation of the gain function in the feedback particle filter needs access to the full 
filtering distribution itself, and in order to avoid numerical issues their algorithm relies 
on a regularization scheme that would be hard to justify biologically. Though formally a 
multivariate version of the feedback particle filter exists, the gain function cannot be 
solved for in closed form. The Neural Eilter, though not an exact particle algorithm, 
overcomes this drawback by being readily applicable in higher dimensions and by being 
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comparatively easy to compute. 

The neuronal network structure (Fig. we propose to implement the neuronal 
dynamics according to Eq. Q is structurally similar to the one proposed in [^. As in 
their model, we represent neuronal activities in terms of their instantaneous firing rate, 
which is an approximation to the spiking nature of biological neurons. In their 
predictive coding model, a central role is assigned to the predictive error signal, which 
can be compared to the dynamics of the novelty neurons or novelty signal dwt in our 
model. Accordingly, equations for the neuronal dynamics and for learning the 
generative weight in the small observation noise limit is structurally similar. However, 
our model generalizes the one in in the sense that we allow dynamics in the prior, 
which is directly reflected in the dynamics of the filtering neurons. 


Implications 

The three central aspects of our work, namely a sampling-based representation, a 
filtering algorithm with adaptive gain and the structure of the recurrent neuronal 
network, result in the following implications for the neuronal network: 


1. The network is robust against neuronal failure. 

2. An internal model about the world becomes manifest in internal neuronal 
dynamics 

3. Neuronal variability is tuned according to sensory reliability. 

4. Neurons may code for the novelty given by discrepancy between prediction by an 
internal model and actual observations. 


The first implication follows directly from the sampling-based representation, namely 
robustness against neuronal failure. For example, if a distribution is represented by 1000 
particles, removing 10% will not significantly decrease the ability of the other particles 
to represent the probability distribution. In an extreme case, we could consider a single 
neuron to represent the whole distribution, given that its activity state can take values 
in the same range as the hidden state and we allow it to sample the distribution in time. 
Apart from that, as we have seen numerically, the ability to perform filtering with a 
reduced number of particles is not affected by particle removal to a large extent either, 
at least not in the particular algorithm we propose. However, some degree of plasticity 
or rewiring would be necessary in order to read out expectations from the decimated 
neuronal population. On the other hand, in parametric representations such as a PPG, 
where each neuron determines the height of a particular tuning curve assigned to it and 
that actually rely on the tuning curves to cover the space densely 40 , neuronal failure 


can be devastating: With each neuron that breaks down, a particular point in state 
space cannot be represented directly anymore. Clearly, a single neuron would never be 
able to account for any other distribution than the one resembling its own tuning curve. 

The second and third implication is a consequence of the adaptive gain, i.e. the 
decoding weight Wt, that determines the emphasis that is laid on new observations. As 
we have demonstrated, the gain Wt in our model increases with sensory reliability both 
according to sensory noise and ambiguity in the input generation, putting more 
emphasis on the observations versus its internal model. In the absence of observations, 
observation noise is maximal and the neuronal dynamics follow those of the hidden state 
which comprises an internal model about the world. With availability of observations, 
sensory reliability naturally increases and variability across samples should decrease, 
because now their dynamics are influenced by the same stimulus via the gain Wt- 
Indeed, it has been found that spontaneous neuronal activities relate to prior 
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expectations about a stimulus in visual cortex [32| . Further, it has been shown that 
inter-trial variability of neuronal responses declines upon stimulus onset [^. Both these 
experimental findings are nicely in line with our theoretical predictions. 


Conclusion and Outlook 

With the Neural Particle Filter we have come up with an algorithm that allows neurons 
to perform nonlinear Bayesian filtering in a sampling-based manner. Specifically, the 
neuronal implementation is based on a network of recurrently connected analog neurons 
whose dynamics are governed by the NPF algorithm. In future work, the biological 
plausibility of this recurrent network model will be further addressed. First, observing 
that the learning rules in general, i.e. for nonvanishing observation noise, do not fulfill 
the requirement of locality, which is needed for a biologically plausible learning rule (e.g. 
a Hebbian learning rule), the model could be enhanced such that individual filtering 
neurons obey different rather than identical dynamics. We could for instance consider 
N different subnetworks and locally determine the weights in these subnetworks, 
possibly taking into account a (slowly-changing) global modulation factor. Such an 
approach would also effectively decouple the particles on a smaller timescale. Second, 
by including the theory of filtering and identification of point processes, our algorithm 
could be extended such that a spike-based representation may be accounted for. 


Methods 


Details on numerical experiments 

The choice of the decoding weight Wt 

In our numerical simulations, we consider and compare two choices of the decoding 
weight Wt, which we will quickly motivate here. 

As a first choice, we consider Wt = cov(x, g(x)^)I]“^, a choice inspired by the 
dynamics of the formal solution. In particular, consider the dynamics of the first 
posterior momenip^ which by comparison with the NPF equation (Eq. directly 
motivates this particular choice of the decoding weights. Thereby, the covariance 
cov(x,g(x)^) is estimated empirically from the samples z^Via 


cov(x,g(x)) 


1 ^ 


- T?; 




N 

N2 Y. ' 

kA^l 


g(zi 


(13) 


As a second choice, we consider W = Wml, he. the decoding weight Wt and thus the 
Neural Filter itself is tuned via maximum likelihood. The learning rule is given in 


Eq. (22) and corresponds to an online update of the decoding weight at each time step. 


What is peculiar about this choice is that here a decoding rather than a generative 
model parameter is learned, illustrating that inference and learning can be very 
intertwined: In fact, it is actually possible to rephrase the filtering problem in terms of a 
learning problem by giving an ansatz for a decoder whose parameters have to be learned 
such that it can perform filtering, but we will not consider this rather extreme case here. 


see SI Appendix 
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Dynamics and parameters 

For our simulations, we use a nonlinear hidden dynamics, that was chosen to have a 
bimodal stationary distribution: 


dxt = axt{b — xl) dt-\-cjxdujt- 


(14) 


The parameters a > 0 and 6 > 0 can be used to tune the shape of the bimodal 
distribution, whereby the positions of the two modes is determined by ±& and a defines 
how sharply the distribution is peaked around the modes. Unless stated otherwise, 
parameters for the deterministic dynamics were a = 3 and b = 1, resulting in a bimodal 
distribution with distinct, but not too sharp peeks at ±1, such that it is possible for the 
hidden state to switch from one mode to the other. 

In our examples in Fig. we employ both linear and sigmoid observation dynamics 
g{x), thereby simulating multisensory integration with our model (cf. Eqs. |^and[^. For 
the plots in Fig. we use only one sensory modality per row. The observation noise cr„ 
and (Ta is varied between 10“^ and 300. 

In the multidimensional simulations in Fig. the hidden dynamics within each 
dimension is independent of the other dimensions and and corresponds to Eq. (14), with 
Tlx chosen to be the unit matrix. The stationary distribution of this dynamics is thus a 
multimodal distribution with 2" peaks for n dimensions. The linear generative function 
is given by g(x) = Jx with J a rotation matrix rotating the hidden state vector by 30 
degrees around each spatial axis. In the simulations shown in Fig. Ty = 0.1 • I. 

Mean-squared errors (MSE) and biases of estimated quantities or parameters 9 were 
computed by 


MSE = 


(15) 

Bias{9) = 

-^0* -6»t, 

(16) 


i=l 


where 9^ denotes the true or benchmark (particle filter) value. Unless stated otherwise, 
MSEs are normalized with respect to the trace of the stationary prior variance Tprior to 
make performance comparable and, if needed, independent of the number of hidden 
dimensions. 

Other simulation parameters comprise the time step size, which was set to 
dt = 0.005 throughout all simulations. All simulations were run for SOO’OOO time steps, 
corresponding to 2500 time units. Unless stated otherwise, MSEs and biases were 
averaged over the last 1000 time units, equaling 200’000 time steps. 


Benchmark models 


We would like to stress that the generative model we chose in our examples are 
nonlinear both in prior as well as in observation dynamics. This implies that a closed 
form solution to this problem does not exist, and thus approximations have to be 
employed. Therefore, when assessing the performance of the NPF, we compare it to two 
approximate filtering algorithms, both of which widely used in approximating the 
posterior distribution in nonlinear filtering problems: a standard particle filter 
continuous-time version of the extended Kalman filter 
these algorithms, see|Sl Appendix| 


18 and 


26 . For further information on 
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Maximum likelihood parameter learning 

The cost function 


For the continuous-time continuous state-space generative model given by Eqs. Q and 
(H), learning in the mathematical literature commonly referred to as ‘system 
identification’, is a tough problem which has hardly been looked at. In fact, the (to our 
knowledge) only reference which gives an explicit cost function for identification in our 
setting is a technical report by Moura and Mitter 47 . Based on a change of probability 


measure, they propose a cost function that is equivalent to the log likelihood of the 
input logpe(3^t) with model parameters 9, which for our model reads: 


Lf^‘“(6)) = {g{x,)) ds, (17) 

where expectations are taken with respect to the filtering distribution at time s, 

p{y^s\ys)- 

In a discrete-time approximation, Eq. ( |17[ ) immediately suggests an online 
maximization scheme. Instead of maximizing the cost function at a time t for the whole 
observation sequence implying we would have to run the filter all over again each 
time we change the parameters, we just perform a gradient ascent with respect to the 
parameters 9 on the last contribution to the integral, i.e. to 

L™i‘“(6i) = (g(xt))^E-^ dyt - i(g(xt))^S-i(g(xt)) dt, ([5]) 


where expectations are with respect to the filtering distribution at time t, p{xt\yt). It 
can be shown that maximization of this cost function is equivalent to a minimization of 
the reconstruction error {dyt — g(xt) dt)'^ at each time step (see 
proof). 


SI Appendix for a 


Learning rules 

Parameter learning is implemented by maximizing the log likelihood (Eq. i by a 
gradient ascent with respect to the model parameters 9, giving rise to the following 
online learning rule for the parameters: 

T]g^A9 = (^^(g(xt))) E-^(dyt - (g(xt))dt). (18) 

This approximation of learning is justified if the time scale of learning is much larger 
than the dynamics of the filter, i.e. for small learning rates. 


Equation (18) exhibits a peculiar structure: The novelty signal dyt — (g(xt)) dt is 
multiplied with a parameter gradient on the posterior estimate of the generative 
function (g(xt)). Thus, we have to take into account the implicit change of the posterior 
distribution, the so called filter derivative, with respect to the model parameters. The 
filter derivative is in general hard or even impossible to compute analytically and many 


identification problems deal with estimating it (cf. for instance 19,35 ). 


In our model, we make use of the approximated posterior dynamics in order to 
derive dynamics of the filter derivative for parameter learning. Equation can be 
approximated by taking N samples from the NPF equation ([^ in order to express the 
posterior estimates: 


N 


39 


(g(xt)) 


(g(xt)) 






N 

1 

N 


k=l 

N 




{k),OZ.t 






de 


(19) 

( 20 ) 
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where denotes the Jacobian of the generative function. Thus, the 

filter derivative is taken into account by considering the infinitesimal change in the 
position of sample z^^^ with respect to the change in parameter value 9. 

The single particle filter derivative, —gg—, cannot be computed directly. However, 
based on Eq. Q, it is possible to compute its dynamics: 



d_ 

do 


{d'. 


Ak)\ 


( 21 ) 


Note that every single parameter that is learned has N accompanying filter derivatives 
of this form. 

In this work we are interested in learning the decoding weight matrix Wt and, for a 
linear observation dynamics g(x) = Jx, in learning the generative matrix J, respectively. 
The resulting learning rules for the components of the decoding weight matrix with 
learning rate rjw are given by 


AWij = 77u/g^(g(xt))^Ej^^ (dyi - (g(xt))dt), 


( 22 ) 


with filter derivative dynamics 


'azf' 

dW,, 


f) 

(F(zf ^) - IEG(zf ^)) + [dyt - g(zf (23) 


where Fij{z^^'^) = denotes the Jacobian of the nonlinear hidden dynamics and 

Bi denotes the unit vector in the i-th direction. This implies that, when we take Wt to 
be a plastic decoding weight matrix that is learned as observations become available, at 
least three equations are needed to infer the hidden state at each time step: First, 

Eq. Q to evolve the states of the filter neurons, and second, Eqs. (22) and (23) to 
update the weights in the filter equation. 

Analogously, learning rules for the components of the generative matrix for linear 
observation dynamics g(x) = Jx read 


= 


VJ 


dJij ) 




(dyt - J(xt) dt) + (F,y ^ (dyt - J(xt) dt)j 


.(24) 


In addition to a term proportional to the filter derivative, the learning rule contains a 
second term that emerges from an explicit dependence of the likelihood in the 
generative weight. Filter derivatives are given by 


dz 


(fe)' 


dJi, 


(F(zf)) - lEj) ^dt - x[';jWedt 


(k) 


(25) 


Bias due to sampling In the derivation of these learning rules, we used a 
sampling-based representation of the approximated posterior in order to estimate 
log-likelihood and gradients. This introduces a bias in these estimations, which we 
should correct for when computing parameter estimates. Unfortunately, this bias is in 
general not analytically accessible, but at least for a linear generative model it can be 


shown that it vanishes with 1/A^ (for a proof, see SI Appendix). 
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Approximation for small observation noise The learning rules we obtain for the 
decoding weights Wt and the generative weights J are not local, implying that the 
weights can only be computed when knowing the state of each filter neuron at each time. 
However, for small observation noise the learning rule for the generative weight J can 
be approximated by a local learning rule with a Hebbian structure. First, we can 
neglect the filter derivative, which decays to zero very fast because in this limit, the 


decoding weight Wt is generally large (cf. Eq. 25), and thus the first term in Eq. (24) 
vanishes. Second, because in this limit the posterior will approach a ^-distribution 
around the true hidden state x, as does the approximated posterior, we can 
approximate the learning rule for J by: 


^7'AJ 


oc (dyt - J(xt) dt){-Kt)'^ « {{dyt - Jxt)xf), 


(26) 


which takes the form of a local Hebbian learning rule. 


Supporting Information 

SI Appendix. The Neural Particle Filter: Mathematical Appendix. 
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